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Gibbs-type random probability measures and the exchangeable random partitions they induce 
represent the subject of a rich and active literature. They provide a probabilistic framework for a 
wide range of theoretical and applied problems that are typically referred to as species sampling 
problems. In this paper, we consider the class of looking-backward species sampling problems 
introduced in Lijoi et al. {Ann. Appl. Probab. 18 (2008) 1519-1547) in Bayesian nonparametrics. 
Specifically, given some information on the random partition induced by an initial sample from 
a Gibbs-type random probability measure, we study the conditional distributions of statistics 
related to the old species, namely those species detected in the initial sample and possibly re¬ 
observed in an additional sample. The proposed results contribute to the analysis of conditional 
properties of Gibbs-type exchangeable random partitions, so far focused mainly on statistics 
related to those species generated by the additional sample and not already detected in the 
initial sample. 

Keywords: Bayesian nonparametrics; conditional random partitions; Ewens-Pitman sampling 
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1. Introduction 

Let X be a complete and separable metric space equipped with the Borel tr-algebra 4%", 
and let be an exchangeable sequence of X-valued random variables defined on 

some probability space (G,^, P). According to the celebrated de Finetti’s representation 
theorem there exists a random probability measure P on X such that, conditionally on 
P, the random variables {Xi)i>i are independent and identically distributed according 
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to P, that is, 


P, 

p ~ n. 

The distribution 11 is commonly known as the de Finetti probability measure of 
and it takes on the interpretation of the prior distribution in Bayesian nonparametrics. 
In the present paper, we consider almost surely discrete random probability measures, 
namely P is such that n[P G ^] = 1, where ^ stands for the set of discrete probability 
measures on (X, 3F). 

If P is discrete almost surely, we expect ties in a sample (Xi,...,X„) from P; that 
is, we expect Kn < n distinct observations with frequencies N„ = (A^i,..., Nk„) satis¬ 
fying Accordingly, the sample induces a random partition of the set 

n}, in the sense that any index j belongs to the same partition set if and 
only if Xi = Xj. We denote by ... ,nj) the symmetric function corresponding 

to the probability of any particular partition of {1,..., n} having j distinct blocks with 
frequencies (ni, .. .,nj). This function is known as the exchangeable partition probabil¬ 
ity function (EPPF), a concept introduced in [17] as a development of earlier results in 
[12]. The EPPF can be specified for every n > 1 and 1 < j <n either via the exchange¬ 
able sequence {Xpi^i or by defining a random partition of N. In the latter case, the 
distribution of the random partition must satisfy certain consistency conditions and a 
symmetry property that guarantees exchangeability. See [19] and references therein for 
a comprehensive account on EPPFs. 

Exchangeable random partitions play an important role in a variety of research areas. In 
population genetics, models for exchangeable random partitions are useful for describing 
the configurations of a sample of genes into a number of distinct allelic types. See [6] and 
references therein. In machine learning, probabilistic models for linguistic applications 
are often based on clustering structures for collections of words in documents. See, for 
example, [22] and [21] for a review. In Bayesian nonpar ametrics, exchangeable random 
partitions are commonly employed at the latent level of complex hierarchical mixture 
models. See [15] and references therein for a review. Other areas of application include 
storage problems, excursion theory, combinatorics, number theory and statistical physics. 
Broadly speaking, exchangeable random partitions and their associated EPPEs provide 
a flexible probabilistic framework for a wide range of theoretical and applied problems 
that are typically referred to as species sampling problems, namely problems concerning 
a population composed of individuals belonging to different species. Indeed, the number 
of partition blocks take on the interpretation of the number of distinct species in the 
sample {Xi,... ,Xn) and the W’s are the corresponding species frequencies. Given the 
relevance and intuitiveness of such a framework, throughout the paper we will resort to 
the species metaphor. 

The main object of our investigation is the class of Gibbs-type exchangeable random 
partitions. These are random partitions which arise by sampling from a random proba¬ 
bility measure, say of Gibbs-type, here denoted by Pq- See [18] for details. Introduced 
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in [10] these exchangeable random partitions represent the subject of a rich and active 
literature. A recent development, first proposed in [16], is the study of their conditional 
properties. This study consists in evaluating, conditional on some information about the 
random partition induced by an initial sample (Ai,...,A„) from Pq, the distribution 
of certain statistics of an additional sample {Xn+i, ■ ■ ■ ,Xn+m)- In particular, in [16] the 
main focus is on the conditional distributions of statistics related to the new species, 
namely those species generated by the additional sample and not coinciding with species 
already detected in the initial sample. A representative example is given by the distribu¬ 
tion of the number of new distinct species generated by (Xn+i, ■ ■ ■ conditional 

on the information of both the number of distinct species in (Ai,..., A„) and their cor¬ 
responding frequencies. See [8] for a generalization to the number of new distinct species 
with a certain frequency of interest. As shown in [8, 13] and [16] these conditional distri¬ 
butions have direct applications in Bayesian nonparametric analysis of species sampling 
problems arising in ecology and genomics. We refer to [3, 4, 7] and [11] for other contri¬ 
butions at the interface between Bayesian nonparametrics and Gibbs-type exchangeable 
random partitions. 

Many problems in the conditional analysis of Gibbs-type exchangeable random par¬ 
titions remain unresolved. For instance, [16] pointed out the practical interest in the 
conditional distributions of statistics related to the old species, namely those species 
detected in the initial sample and possibly re-observed in the additional sample. Two 
illustrative examples are given in Proposition 4 of [16] and in Theorem 3 of [8]. In gen¬ 
eral the class of species sampling problems concerning old species has been referred to 
as looking-backward and it represent the focus of the present paper. We study two novel, 
and practically applicable, looking-backward species sampling problems. In particular, 
we derive 

(i) the conditional distribution of the number of old distinct species re-observed in 
(A„+i,... ,Xn+m)^ given complete or incomplete information on the random par¬ 
tition induced by (Ai,..., A„); 

(ii) the conditional distribution of the number of old distinct species re-observed with 
a specific frequency of interest in (A„+i,..., Xn+m), given complete or incomplete 
information on the random partition induced by (Ai,..., A„). 

Specifically, by complete information we refer jointly to the number of distinct species 
in (Ai,..., A„) and their frequencies, whereas by incomplete information we refer solely 
to the number of distinct species in (Ai,..., A„). Besides the sets of complete and in¬ 
complete information, we also consider almost-complete information. This information 
refers jointly to the number of distinct species in (Ai,...,A„) and a subset of their 
corresponding frequencies. 

The present paper broadens the scope of previous literature on conditional distribu¬ 
tions for Gibbs-type exchangeable random partitions, by investigating in depth some 
statistics related to old species. In the framework of Gibbs-type exchangeable random 
partitions, looking-backward problems create a distinction between conditioning on com¬ 
plete, incomplete and almost complete information, which to the best of our knowledge 
has not been dealt with explicitly in previous studies. We expect the results introduced 
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here to have an impact in the analysis of Bayesian nonparametric models for species 
sampling problems, which have acquired increasingly complex forms in recent years to 
meet the demands of scientihc applications. The paper is structured as follows. Section 2 
recalls the definition of Gibbs-type exchangeable random partition and introduces pre¬ 
liminary results relevant to the analysis of their conditional structure. Section 3 deals 
with the looking-backward species sampling problems (i) and (ii) in the general case 
of Gibbs-type exchangeable random partitions and in the special case of the celebrated 
Ewens-Pitman sampling model. The context of almost-complete information is also dealt 
with in Section 3. Section 4 contains some numerical illustrations of the present results. 
Proofs are deferred to the Appendix. 


2. Preliminaries and main definitions 


Gibbs-type exchangeable random partitions were introduced in [10] and further investi¬ 
gated in [18]. This class of exchangeable random partitions is characterized by an EPPF 
with a product form, a feature which is crucial for mathematical tractability and, in par¬ 
ticular, facilitates intuition. Let 'Dnj = {{ni, ... ,nj) : > 1 and the 

set of the partitions of n > 1 into j <n positive integers. Moreover, for any x > 0 and 
any positive integer n, we denote by (x)„'fi and {x)nii the rising factorials and falling 
factorials, respectively. 

Definition 2.1. Let {Xi)i>i he an exchangeable sequence directed by Pq- Then, the 
exchangeable random partition induced by {Xi)i>i is said of Gibbs-type and it is charac¬ 
terized by an EPPF of the form 

j 

pf\ni, ...,nj) = Vn,j ]^(1 - cr)(„._i)ti, (2.1) 

Z = 1 

for a < \ and nonnegative weights {Vnj)j<n,n>i satisfying the recursion Vnj = 
Vn+i,j+i -b (n- ja)Vn+i,j, with Vip = 1. 


Gibbs-type exchangeable random partitions are completely specified by the parameter 
(7 < 1 and the collection of weights satisfying a backward recursion. Note 

that Definition 2.1 also provides the distribution of the number Kn of distinct species in 
a sample of size n from Pq, that is. 




j] — VnJ 


(ji 


( 2 . 2 ) 


with being the so-called generalized factorial coefficient. We refer to [2] for 

details. The next example recalls the Ewens-Pitman sampling model, a noteworthy ex¬ 
ample of Gibbs-type exchangeable random partition introduced in [17] and generalizing 
the celebrated Ewens sampling model in [5]. See [1] and references therein for a compre¬ 
hensive account on the Ewens sampling model. Another notable Gibbs-type exchangeable 
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random partition, still related to the Ewens-Pitman sampling model, has been recently 
introduced and investigated in [9]. 


Example 2.1. For any a G (0,1) and 6 > —a, the Ewens-Pitman sampling model is a 
Gibbs-type exchangeable random partition with weights (Vnj)j<n,n>i of the following 
form 




^n,j — 


( 0 ) 


rafl 


(2.3) 


The Ewens sampling model with parameter d > 0 is recovered from the Ewens-Pitman 
sampling model by letting cr ^ 0. See, for example, [17] and [20] for details and further 
developments. 


The recursion in Definition 2.1, for a hxed cr, cannot be solved in a unique way. The 
solutions form a convex set where each element is the distribution of an exchangeable 
random partition. Theorem 12 in [10] describes the extreme points of such a convex set. 
For any n > 1 let 


Cn(o-) = { log(n) 


if CT G (—oo, 0), 
if cr = 0, 
if cr G (0,1). 


For every Gibbs-type exchangeable random partition there exists a positive and almost 
surely finite random variable such that 

a.s. ^ 


c„(a) 

as n ^ -boo, and such that a Gibbs-type exchangeable random partition is a unique 
mixture over x of extreme exchangeable random partitions for which So- = x almost 
surely. For cr G (—oo, 0) the extremes are Ewens-Pitman sampling models with parameter 
(cr, —a>c); for cr = 0 the extremes are Ewens sampling models with parameter >c > 0; for 
cr G (0,1) the Ewens-Piman sampling models are not extremes. See Section 6.1 in [18] 
for details on Sa- 

A generalization of Definition 2.1 has been recently introduced in [16] to study con¬ 
ditional properties of Gibbs-type exchangeable random partitions. To recall this gener¬ 
alization a few quantities, analogous to those describing the random partition induced 
by an initial sample (Ai,..., A„) from Pq, need to be introduced. Let A^,..., A^^ be 
the labels identifying the A„ distinct species detected in the initial sample and, for any 
m > 1, define 


m K 


i=i j=i 


' (Ayj^j) 


(2.4) 


as the number of observations in an additional sample (A„_|_i,... ,Xn+m) not coinciding 
with any of the A„ distinct species. Denote by Km'^ the number of new distinct species 
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generated by these Lm'^ observations and by their corresponding 

identifying labels. Therefore, 


M^(„) = (Ml,... ,M^(„)), 


with 

m 

( 2 - 5 ) 

3 = 1 

for z = 1,..., Km \ are the frequencies of the new Km'^ distinct species detected among 
the observations of the additional sample. Analogously, 


with 

m 

= (2-6) 
3 = 1 

(n) 

corresponds to number of observations, among the m — Lm observations of the additional 
sample, coinciding with the zth distinct old species detected in the initial sample, for 
i = As pointed out in [13], from a Bayesian nonparametric perspective the 

joint conditional distribution of the random variables (2.4), (2.5), (2.6) and Km\ given 
(Xi,..., A„), can be interpreted as the posterior counterpart of the EPPF (2.1). This then 
provides a natural framework for Bayesian nonparametric analysis of species sampling 
problem. 

In [16], the main focus is on conditional distributions of statistics related to the 
new species generated by (A„+i,..., A„+m). For instance, by suitably marginalizing 
the joint conditional distribution of the random variables (2.4), (2.5), (2.6) and Km\ 
given (Ai,..., A„), one obtains the conditional distribution of the number of new distinct 
species, namely 


P[AW=fc|A„=j,N„ = n] 


Vn+jn,j+k '^{m,k]a,-n + ja) 

* n,j ^ 


(2.7) 


with ^{n,j-,a,p) being the so-called noncentral generalized factorial coefficient. We refer 
to [2] for details. Accordingly, the Bayesian nonparametric estimator, under quadratic 
loss function, of the number of new distinct species generated by the additional sample 
coincides with 

/CL”) = E[a 1-) 1 = J, N„ = n] = E[Ai") I = j ]. (2.8) 

We refer to [3, 13, 14] and [16] for applications of (2.7) and (2.8), under the choice of Vnp 
in (2.3), to Bayesian nonparametric inference for species variety in genetic experiments. 
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As a generalization of (2.7), Theorem 3 in [8] provides the conditional distribution, given 
of 

^ l{Mi=i}, (2.9) 

i=l i=l 

for any / = 1,... ,n + m. In words, (2.9) corresponds to the number of distinct species 
with frequency I generated by (Xn+i, ■ ■ ■, Xn+m)- The conditional expected value of (2.9), 
given (All, ■ • ■, A'„), provides the Bayesian nonparametric estimator, under quadratic loss 
function, of the number of distinct species with frequency I generated by the additional 
sample. 


3. Two looking-backward probabilities 


Before presenting our results, it is worth stating the fundamental difference between 
looking-backward species sampling problems and the species sampling problems investi¬ 
gated in [16]. A common feature of the conditional distributions introduced in [16] is their 
independence from the information on the frequencies N„ induced by the initial sample 
{Xi,... ,Xn)- As a representative example, note that the distribution (2.7) satisfies the 
following identity 


= k\K^ = j, N„ = n] = P[ifi") = fcjAA = j]. 

Such a property of independence characterizes all the statistics concerning the new species 
in the additional sample (Al„+i,... ,Xn+m)- Indeed, since (2.6) does not contain any in¬ 
formation on new species, the conditional distributions of these statistics can be obtained 
from the joint conditional distribution of the random variables (2.4), (2.5) and Km^ , given 
{Xi,... ,Xn). In Proposition 1 of [13], this joint conditional distribution is shown to be 
independent of N„. Hence, Kn is a sufficient statistic for the species sampling problems 
discussed in [16]. 

Differently, the conditional distributions of statistics concerning old species depend 
on the information of both the number Kn of distinct species and the corresponding 

(n) 

frequencies N„. This is to say that, letting Tm be a statistic related to old species, in 
most cases, one obtains 

P[Ti") e -liP^ = j,N„ = n] ^P[tM e ■\Kn=j]. (3.1) 

As an example, the distribution of (2.9) satisfies (3.1). See Theorem 3 of [8] for details. 
See also Proposition 4 in [16] for another example. According to (3.1), the analysis of the 
looking-backward species sampling problems naturally leads to consider at least two sets 
of information on the random partition induced by (Ali,..., X„): (i) a complete informa- 
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tion, namely Kn and N„; (ii) an incomplete information, namely Kn- We also consider 
almost-complete information, namely Kn and a subset of N„. In the next subsections, 
we present and discuss the results of our paper. We focus on deriving the conditional dis¬ 
tributions of two looking-backward statistics, given complete or incomplete information. 
This will be the subject of Section 3.1 and Section 3.2. The conditional distributions of 
these two statistics given almost-complete information can be derived through similar 
arguments applied when conditioning on incomplete information. This will be discussed 
in Section 3.3. 


3.1. Probabilities of re-observing old species 


In this section, we consider the distribution of the number of old distinct species that are 
re-observed in {Xn+i, ■ ■ ■, Xn+m), conditional on complete and incomplete information 
on the random partition induced by {Xi,... ,Xn)- Formally, in the context of complete 
information, we are interested in the random variable which is defined in distri¬ 

bution as 


p[ii; 


(nj,n) ^ 2.1 ^ ] 


' Kn 




= X 


— J; Nn — 


(3.2) 


In the context of incomplete information, we are interested in the random variable 
which is defined in distribution as 


]P[Rt^)=x]=] 


' Kn 




= X 


.i^l 


Kn=J 


(3.3) 


In the next theorem, we derive the factorial moments of the random variables in (3.2) 
and (3.3). By means of Theorem 1 in [8], we obtain (3.4). Accordingly, (3.5) follows 
from (3.4) by suitably marginalizing the frequencies N„. These moments then lead to 
the corresponding distributions by means of standard arguments involving probability 
generating functions. 


Theorem 1. Let {Xi)i>i be an exchangeable sequence directed by Pc- Then, for any 
integer r > I one has 




nj,n)\ 


Jrili 


v^O ^ 


(-1)^ 


E 


m 


E 


Vn+m,j+k fc; a, -n -I- + (j ~ ^)°') 

^n,j ^ 


(3.4) 


X 
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and 


E[iRt^'>) 


rili 

I ^ / ■ _ \ n-{j-v) 

X! []‘^is,v-a)‘rf(n-s,j-v;a) (3.5) 


E 

fc =0 


Vn+m,j+k ^{m,k-,a,-n-\-s-\- (j -v)a) 
Vn.i 0 -'' 


where Cj y denotes the set of the v-combinations (without repetitions) of the elements 

{ 1 , 


The distributions of and Rm’^^ are interpretable as the posterior distributions 

of the number of old distinct species that are re-observed in (X„+i,..., given, 

respectively, complete and incomplete information on the random partition induced by 
(Xi,..., Xn). Accordingly, the Bayesian nonparametric estimators, under a quadratic 
loss function, coincide with the expected values of the random variables and 

Rm’^'^■ An expression for these Bayesian nonparametric estimators, denoted by = 

and = E[Rm’^^], is presented in the next corollary. See Proposition 1 and 

Proposition 2 for an expression of these estimators under the Ewens-Pitman sampling 
model. 


Corollary 3.1. The Bayesian nonparametric estimator of the number of old distinct 
species that are re-observed in an additional sample of size m, given complete information 
on (Xi,..., Xn), coincides with 


n m 




Vn+m,i+k ^{m,k-,a,-n-\-i-\- {j - l)g) 


i —1 k —0 


v; 


n,J 


Moreover, given incomplete information on (Xi,..., Xn), the Bayesian nonparametric 
estimator coincides with 


= 7 - 


^{n,j-,a) 


'*-(7-1) 

E 


(s, l;CT)‘^(n-s,j-l;CT) 


E 


Vn+ni,j+k ‘^(w, fc;a, -n -I- s -I- {j - l)g) 

y n.i ^ 


Here mi > 0 denotes the number of distinct species observed in the initial sample with 
freguency i. 

The distributions of and Rm’^\ under the Ewens-Pitman sampling model, 

are specified in the next propositions. We devote special attention to the Ewens-Pitman 
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sampling model because it has proven suitable for inference in species sampling problems, 
particularly in genomics. See, for example, [13] and [8] for details. The corresponding 
results for the Ewens sampling model can be recovered by letting cr —>■ 0 and applying 
equation 2.63 in [2]. 


Proposition 1. Under the Ewens-Pitman sampling model, the distribution of 
coincides with 


and 


1 ^ 


{9 + n) 


mfl 


v—j—x 


J - X 


(- 1 ) 


v-\-x 


X ^ h + rici + <yv ] 

{ci,...,c„}GCj,v \ i=l / mtl 


7?("J,n) — n _ 
'^m J 


{9 + n)m-ti 


m,,{9 + n-i + a)m^i. 


(3.6) 


(3.7) 


The random variable i?m assigns positive probability to any integer value x such that 
0 < X < min(j, m). 


Proposition 2. Under the Ewens-Pitman sampling model, the distribution of R 
coincides with 

V[Rt^^=x] 


("j) 


^in,j;cr){9 + n)mn 


(-1)' E 


V—J—X 


V 

J - X 


(- 1 ) 


v-\-x 


(3.8) 


and 



s + V(7)mn‘^{s,v,a)‘i^{n - s,j 


v,a) 


7^("d) = j _ 

™ ^{n,j-,a){9 + n)mn 



s + a)mn‘^(s, l;cr)'^(n - s,j 


l;cr). 


(3.9) 


The random variable Rm’^^ assigns positive probability to any integer value x such that 
0 < X < min(j, m). 
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3.2. Probabilities of re-observing old species with a certain 
frequency 

In this section, we consider the distribution of the number of old distinct species that are 
re-observed in {Xn+i, ..., Xn+m) with frequency 0 <l <m, conditional on complete and 
incomplete information on the random partition induced by the initial observed sample 
{Xi,... ,Xn)- Note that the case / = 0 is of particular interest, representing the number 
of old distinct species that are not re-observed in the additional sample. Formally, in the 
context of complete information, we are interested in the random variable which 

is defined in distribution as 



(3.10) 


In the context of incomplete information, we are interested in the random variable R\'^.^'^ 
which is defined in distribution as 



(3.11) 


In the next theorem, we derive the factorial moments of the random variables in (3.10) 
and (3.11). The factorial moment (3.12) is obtained by a direct application of Theo¬ 
rem 1 in [8]. With regards to the factorial moment (3.13), this is obtained from (3.12) 
by suitably marginalizing the frequencies N„. Again, these factorial moments lead to 
the corresponding distributions by means of standard arguments involving probability 
generating functions. 

Theorem 2. Let (W)i>i be an exchangeable seguence directed by Pq- Then, for any 
0 <l <m and any integer r > 1 one has 




(3.12) 



and 




(3.13) 
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n-(j-r) 

X E 

-r 

Vn+ 7 n, 3 +k ‘^(m - rl, k;a,-n + s + (j - r)a) 

V ■ ’ 

where Cjr denotes the set of the r-combinations (without repetitions) of the elements 

Again, the distributions of and are interpretable as the posterior distri¬ 

butions of the number of old distinct species that are re-observed in {Xn+i ,... ,Xn+m) 
with frequency 0 <l <m given, respectively, complete and incomplete information on the 
random partition induced by (Xi,...,X„). The corresponding Bayesian nonparametric 
estimators, denoted by and are specified in the 

next corollary. See Proposition 3 and Proposition 4 for an expression for these estimators 
under the Ewens-Pitman sampling model. 


xE 


(s, r;a- l)^{n - s,j - r; a) 


Corollary 3.2. The Bayesian nonparametric estimator of the number of old distinct 
species that are re-observed, with frequency 0 <l <m, in an additional sample of size m, 
given complete information on (Ai,...,A„), coincides with 


'Tpin,j,n) 

'^l,m 



X 


m 


E 


Vn+m,j+k ^{m-l,k-,a,-n-{-i + (j - l)g) 

^n,j ^ 


Moreover, given incomplete information on (Ai,...,X„) the Bayesian nonparametric 
estimator coincides with 


'Vi ™ — 




-'-(j-l) 

E 


{s,l-a-l)‘if{n-s,j -l;a) 


E Vn+m,j+k '^{rn-l,k-,a,-n + s + (j - l)cr) 
k=0 


cr 


Here mi >0 denotes the number of distinct species observed in the initial sample with 
frequency i. 


Finally, the distributions of and Rm’^\ 

model, are specified in the next propositions. 


under the Ewens-Pitman sampling 
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Proposition 3. Under the Ewens-Pitman sampling model, for any 0 < I < m, the dis¬ 
tribution of coincides with 


p[r; 


("J'.n) 

/,m 


{0 + n)mn ^ Vl/ - 




\y-x 


m 

— yl 


17 {uci - g)iti [s + n-y^^nc,-\-ay 


{Cl,...,Cy}^Cj^y 2—1 


2=1 


(3.14) 


(m-yi)tl 


and 


n 


in,j,n) _ _ 1 


{6 + 


( 7 ) + + (3-15) 


The random variable assigns positive probability to any integer value x such that 

0 < X < min(j, m). 

Proposition 4. Under the Ewens-Pitman sampling model, for any 0 < I < m, the dis¬ 
tribution of R ;’))/coincides with 




1 


'rg’{n,j-,a){9 + n)mn'^,,\y-x)^ 

y — 


-\j — yj /fi\ 

( ]{S + n-s + ay)(^rn-yi)n'^{s,y;a-l)‘rf{n-s,j-y;<7) 

fs — ii ' ^ 


(3.16) 


m 

I,... ,l,m — yl 
-U-v) 


and 




i-O-i) 


(3.17) 


/ \ 

y] f g ] iP + n-sP 1; CT - l)^{n -s,j - 1; cr). 

S — 1 ^ 


r/ie random variable assigns positive probability to any integer value x such that 

0 < X < min(j, m). 
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3.3. Conditioning on almost-complete information 


We now consider the distribution of the number of old distinct species that are re-observed 
in the additional sample {Xn+i, ... ,Xn+m), conditional on almost-complete information. 
This looking-backward species sampling problem can be seen as a generalization of the 
problems discussed above. For any integer p G {1,..., Kn} let r = {ti, ..., Tp} be a col¬ 
lection of integers such that 1 <ti < ■ ■ ■ <Tp < Kn and define the subset of p frequencies 
Nr.n = In the context of almost-complete information, we are interested 


in the random variables R 


{n,j,n^) 

m 


and R 


Um 


which are defined in distribution as 


j,m) = a;] = ] 


'Kn 

.2 = 1 


= X 


Kn = j, Np 


(3.18) 


and 


P[i?) 


= a:l = ] 


' Kn 

.2=1 


— rip 


(3.19) 


The following lemma is fundamental in determining the factorial moments of the random 
variables introduced in (3.18) and (3.19) and, accordingly, to derive the corresponding 
distributions. 


Lemma 3.1. Let (Xi)i>i he an exchangeable sequence directed by a Gibbs-type random 
probability measure Pq- For any integer p G {1,... ,Kn}, denote by n = {vi,... ,iyK„-p} 
the complement set of t with 1 < < • • • < < Kn and define the subset of frequen¬ 
cies N^^n'■= ... ■ Then 


1P[N^,„ = nv\Kn = j, Np,„ = rip] 


(3.20) 








i)ti- 


The random variable = ni,|(it'„ = j, Np^„ = rip) assigns positive probability to the 
set rip^ J—p ' 


The factorial moments of Rm’^’"'^^ and are derived by means of Lemma 3.1 

and along lines similar to the proof of Theorem 1 and Theorem 2, respectively. In par¬ 
ticular, with regard to the factorial moments of the random variables in (3.18), one has 

E[(I?L”^Ui|Ifn=j)Np,„ = np] 

= ■ —T E E p 

nr,, J - p; a) \r-vi- V2J 
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X E E 

{di,...,d„j}GCp,„i s=V2 


n-ELinr. 

s 


(3.21) 


'rf{s,V 2 ]cr)’rfi n - - s,j -p-V 2 \i 


\ 2 = 1 / 

^ 14+mj+fc fc; 0"^ + YZLi + U -Vi- V 2 )cr + s) 

X 2^ V ■ (T^ 

k=0 


We point out that (3.21) is a generalization of both the results stated in Theorem 1. 
Indeed, by setting T=j in (3.21) one obtains (3.5), whereas by setting p = j in (3.21) 
one obtains (3.4). With regard to the factorial moments of the random variables in (3.19), 
one has 


mRtL)ril\Kn=J,-Nr,n = nr] 






X E 

{di,...,d„}eCp,„ i=l 
"-5dr=i 'n-Ti-U-p-(r-y)) 

X E 


n-ELi^^r 

s 


p 


: ^{s,r — v;a — l)^ [ n — n-r^ — s, j — p — {r — v); a 


(3.22) 


E 

fc=0 


Vn+m,j+k '<^{m-rl,k]a,-n + XlLi + s + (j - r)cr) 


14 


Note that (3.22) includes as special cases both the results stated in Theorem 2. Indeed, by 
setting T =j in (3.22) one obtains (3.13), whereas by setting p = j in (3.22) one obtains 
(3.12). 


4. Numerical illustrations 

We can now apply the derived conditional results which are interpretable, from a Bayesian 
nonparametric standpoint, as estimators or predictions. The range of problems to be ad¬ 
dressed can be delineated using the following hypothetical setting. A nineteenth century 
naturalist samples a number of marine species in an expedition to a remote island, re¬ 
porting in his notebook the number of distinct species sampled and their frequencies. We 
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are interested in estimating the abundance of a particular species observed at that point 
in time. If all the data in the notebook are available, the looking-backward estimators of 
Theorems 1 and 2 which condition on complete information can be applied to solve this 
problem. Now suppose that certain critical pages of the notebook are missing, and the 
only datum available is the number of distinct species in a sample of known size. This 
corresponds to the setting of incomplete information. 

In a general application, the species could be words in a text, mutations of a gene in a 
population, or the names of newborns in a year. The availability of complete or incomplete 
information could be determined by constraints of the experimental method used or, 
in the case of a meta-analysis, restrictions of access to data. For example, techniques 
routinely used in biology provide indications about presence or absence of a particular 
species, say a particular bacterium or a genetic mutation of interest, but are not suitable 
for measuring the relative species abundance. The experimental techniques, in these cases, 
produce datasets with partial information. 

We illustrate an application of the derived looking-backward estimators in a simulation 
study. Two thousand samples were simulated from the Ewens-Pitman sampling model 
with 6 = 100 and a = 0.5. The top row of panels in Figure 1 show the conditional ex¬ 
pectations of the number of re-observed species in an additional experiment with sample 
sizes ranging from 0 to 4000. These two panels display discrepancies of the estimates 
under complete versus partial information and illustrate sensitivity to the choice of the 
parameters 6 and a. The estimates were computed across a range of possible prior pa¬ 
rameters, including the true data distribution. Interestingly, the divergence between the 
two estimators depends more heavily on a and is minimized when the parameter match 
those of the true data distribution. We refer to [13] for detailed arguments on practical 
selection of the prior parameters in this model. The second row of panels, in contrast, 
displays estimates for the number of new species in the additional sample. In this case 
the estimates are identical under complete and partial information. 

Figure 2 considers simulated data that have not been sampled from the Ewens-Pitman 
sampling model. Here, the sample was generated from a Zeta distribution, whose power 
law behavior is common in applications, and analyses were still performed using the 
Ewens-Pitman sampling model. Looking-backward estimators under complete and in¬ 
complete information are displayed for several prior parameters values. These are con¬ 
sistent with the relationship between the choice of the model parameters and the result¬ 
ing conditional expectations shown in Figure 1. Figure 2 also displays (black line) the 
conditional expectations under the true zeta sampling model, assumed unknown to the 
investigator. 

The simulations in Figure 1 were iterated, generating 1000 independent datasets of size 
n = 2000 from the Ewens-Pitman sampling model with 0 = 100 and cr = 0.5. Figure 3 
shows the distribution of the estimator for the number of distinct old species re-observed 
in an additional sample of size 500. The blue and red histograms correspond to the 
estimator under complete and incomplete information, respectively. As expected, the 
estimators have the same mean but the estimator fit to complete information has slightly 
higher variance. 
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0=100 <t=0.5 



0 = 100 <r=0.5 



Figure 1. Estimators for the number of old and new distinct species observed as a function of 
the size m of the additional sample. An initial sample of n = 2000 steps was drawn from the 
Ewens-Pitman sampling model with 9 = 100 and a = 0.5. The top panels show estimators for 
the number of old species under complete information, , and incomplete information, 

.^^ 000 ,j) bottom panels show the estimator for the number of new species. The 

panels on the left show estimators computed under 9 — 100 and allowing a to vary. The panels 
on the right show estimators computed under cr = 0.5 and allowing 9 to vary. 


Appendix 

A.l. Proofs of the results in Section 3.1 

Proof of Theorem 1. With regard to the rth factorial moment of this is ob¬ 

tained by a direct application of Theorem 1 in [8]. Indeed, by means of the Vandermonde’s 
identity one has 
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6»=50 0 = 100 



0=200 0=400 



Figure 2. Estimators for the number of old distinct species observed as a function of the size m 
of the additional sample. An initial sample of n = 2000 steps was drawn from a zeta distribution 
with scale parameter 1.3. Each panel shows estimators computed with a fixed 9 and allowing a 
to vary. The black line in each figure shows the expected number of old distinct species in the 
sampling model. 


Theorem 1 in [8] then leads to (3.4) by taking the expected value of both sides of (A.l). 
This completes the first part of the proof. With regard to rth factorial moment of the 
random variable Rm’^\ by combining (3.4) with the distributions displayed in (2.1) and 
(2.2), we write 








{ni,...,nj)G'Dn 


/il, . . . , fljJ 


(A.2) 
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Figure 3. Histograms of the estimators for the number of old species under complete infor¬ 
mation, 7 ^. 500 °°’^’"^ 1 fl'iid incomplete information, ^ 500 °° '^^ ■ To construct the histograms, these 
estimators were computed conditional on 1000 independent initial samples of length n = 2000 
each, which were drawn from the Ewens-Pitman sampling model with 9 = 100 and a = 0.5. 


E E 


Vn+rn,i+k ^ {m,k\a,-U+ (j 
V ■ fT^ 

* n,j ^ 


v)(t) 


and prove that it coincides with (3.5). The proof is mainly devoted to solve the sums over 
the indexes ni,..., and ci,..., c„. Once these sums are solved, then (3.5) follows by 
some algebra involving factorial numbers and noncentral generalized factorial coefficients. 
By means of equation 2.61 in [2], and using the fact that Cj^v has cardinality (^), from 
(A.2) one has 




rili 




^{n,j]u) 


E 








A;=0 17=0 

n—j+1 n-j+l-(si-l) '^-i+l-Er=i^ 

X E E - E 


Si — l 


S2 — 1 




E v 

i= 


I 


V 

X n(i ~ o-)(si-i)ti 

i=l 


(A.3) 
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X CT, -n + ^ s, + (j - v)aj ^ 

In order to solve the nested sums over the indexes si,..., s„ in (A.3), we first deal with the 
sum over the index Sy and then we introduce a suitable recursive argument for solving 
the remaining sums over the indexes First, recall that for any x > 0 and 

0 < ?/ < x, for any a>0, 6>0, c>0 and for any real number d one has the following 
identity 



y + c 
y 


']‘^{x,y + c;d,a + b) = '^ (j, y; d, a)‘^(x - j, c; d, b). (A.4) 

' 3=y 


See Chapter 2 of [2] for details. Then, let us consider the sum over the index in (A.3), 
that is, 


E 




i)ti 


1 




1 


m. 


Si , . . . , , 72 Sj 

V \ ( ^ ^ 

k-,(T,-n + '^s^F{j -v)cr Win-'^SiJ - v;a 


2=1 


2=1 


a-? ^ \si,...,s^_i,n- 


17—1 




-i)ti 


E 

s„ = l 


E v— 1 

i=l Si 


(1 ~ f^)(s„-l)tl 


27 — 1 


21 — 1 


X m, fc;cr, —n + ^ + s„ + (j — v)a j'lfin — ^ — s„, j — n; cr . 


By a direct application of (A.4) to the coefficients "^(m, k; cr, —n + Si + (j — v)a) 
and ‘^(n — Si,j ~ v; a) we can write the last expression in the following expanded 
form 


1 

aJ-'" 


E V — 1 


27 — 1 

2=1 


m 


xE 


'^(t, k; a 


n-Ei=l 

E 

l=j-v 


E v—1 

i=l s> 


‘^(IJ -v;a,-(j -v)a) 
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Sv = l ^ ^ 


( ^;-l 

n-'^Si - Sy - {j - 
i=i 


/ (m-t)tl 

—1 


(by the Vandermonde’s identity to expand (n — J2i=i Si — Sy — (j — 'y)cr)(m-i)ti) 




11(1 


i)ti 


2=1 




t—k ^ /j,=0 

u-l 


n - l^i=r s, j _(j _ 


X E 

l^j-V 


^ Ei=l / j V—v-U —1 

X E r^~^~ ^i=i '®’ 

St; —1 ^ 

X (1 - f^)(s.-i)ti(-(j - + ^)(n-i-Err7 s.-s„)ti 

(by Equation 2.56 in [2] to solve the sum over the index s„) 

1 

1 


crJ " Si/ 1 


-1 . n(i-^)(*^- 


l)tl 


2=1 


E( 7 )*-(a;<>)i:(’ 7 ')(-(i-.w„. 

i—/c h —0 


^ Ei=i 


1=3-11 


v — 1 


in — l — Si, 1; tr, (j — v)a — h 


providing the solution for the innermost nested sum over the index s„. Therefore, ac¬ 
cording to the last identity, the rth factorial moment of has the following reduced 
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expression 




rili 


■E 






i?=0 




0-ii 


n-j + 1 n-j + l-(si-l) ra-i+l-ELi (si-1) 

X E E - E 


Si—l 32—1 

V — 1 

n(i -cr)(s,-i)ti 




E v — 1 
i—l 


(A.5) 


2=1 


i—k h —0 


Ei=i Si 1 ^ Y-vi;— 1 \ 

l=j-v ^ ^ 


X 


1 

crJ-J'+l 





2; —1 

{j -v)a 

i=l 



Starting from (A.5) we can now introduce a recursive argument to solve the remaining 
nested sums over the indexes si,..., s«_i. In particular, consider the sum over the index 
s„_i, that is, 


s„_i = l V®lj ■ ■ • ; Y^i=l ^ij 


J - 0-> -(j - 'y)0') 


X 


t—k 
n—Yl'i=i Si —1 

X E 


l=j-v 


h—0 

■^v—l 

i 


(A.6) 


E V— I 

i=l S' 
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which can be written as 
1 




i)ti 


cri-’'+i \si,...,Sy-2,n-Yl,i=iSiJ 

i —/c h —0 

n—YliZi St —1 

X ^ -V)a) 

X 


E 


E v — 2 \ / X— \v —2 

fcl s*\ Si -S«-1 


Sj ;-1 


l-\ 


(1 - n-Z + l- ^Si- S„_ 1 , 1; cr, {j - v)o - h 


2=1 


(by (A.4) to expand ^{n — l — J2l=i Si — s„_i + 1,1; cr, (j — v)a — h)) 

v-2 


1 


CrJ *'+1 Vsi,---,S„-2,n-X;i=i 




X 


m / \ m—t / ,\ 

+—U \ ^ / h—n \ / 


t—k ^ ' h—[) 

n-YLlZl St-1 

X ^ -v)a) 


1=3 —v+l 

n-l-YZ“Zi S' 

S l-l 


2 = 1 


E v — 2 

i=l Si 


-l,z,n-l + l- z-Y,l.= 


2 = 1 


‘^(z,l;cr) 


r-^+l-z-ELf Si ^ ^-u-2 

n - / + 1 - z - Si 


E 

s^_l=l 


52; — ! 


X (1 - ff)(.„_i-i)ti(-(j - 

(by equation 2.56 in [2] to solve the sum over the index s„_i) 

= ” y-2 )n(l-^)(s 

cri ’'+1 Vsi,..., s^_2, n - 


(st-l)tl 
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X i; (7) *’«. k; a) ’£■ (“7 (-0 - 

i —/t.— 0 
Si-1 

—D + 1 


'n.-l-YFi=i Si 


E 

2 = 1 


E t;—2 

i=i Si 


z-i,z,n-/+i-2-x;r=i 


‘^(z,l;cr) 


i;-2 


in — l + l — z — Si, 1; tr, (j — ?;)(t — h 


(by (A.4) to solve the sum over the index z) 


1/ \Sl, . . . , S-i;_2, ^ y^^-—1 




-i)ti 


xf;(7)*’(f.t;a)|:'(’"7‘)(-o--.w„. 

i —Ai h —0 

l=j-v ^ ^ 


X 


1 

crJ-ii+Z 





v-2 

l-'^s,,2;a, {j 

i=l 


v)a 



Note that the resulting expression has the same structure of the summand in (A.6). This 
fact suggests the possibility of repeating the above arguments to each of the remaining 
nested sums over the indexes s„_ 2 ,---,si, respectively. In particular, after a repeated 
application of these arguments we can write the rth factorial moment of Rm’^^ as follows 




rili 


^{n,j]a) 




A;=0 


tj=0 


f;( 7 )*-(a;.)’x:‘(’" 7 ')(-(i-.v)„, 

i —Ai h —0 


(A.7) 


^ HU f 
l—j—V ^ ' 


v)a)^{n — /, v\ cr, (j — v)a — h). 
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Finally, a direct application of (A. 4) to expand — l,v,a, (j — v)a — h) we can write 
(A. 7) as 




rili 


{'n,j;(T) 


E 

A;=0 


Vn+7n,j+k 1 (j 


(T^ 

y n.i ^ 


D=0 


r — V 




T'-U-'V) 

E 


‘^is,v;a) 


m , \ m-t , V 

^ E E E/j 

t —/c h —0 

n—s / ^ 

X E ( ]{l)(m-t-h)ni-{i+ -V]a,-{j-v)a) 

1=3-V V ® / 

which leads to (3.5) by means of (A.4) and some standard algebra involving factorial 
numbers and noncentral generalized factorial coefficients. This completes the second part 
of the proof. □ 

Proof of Proposition 1. By combining the rth factorial moment of in Theo¬ 

rem 1 with Vnj displayed in (2.3) one has 




r4.lJ 


(6 -\- n)mti 


-u^O 




X ^ Ef“+-^) ‘^(m,k;a,-n-i-'^nci-\-{j-v)a\ (A.8) 

{ci,...,c„}eCj,„ fc=o \ i=i ) 


T\ 


{9 n)mti 


v—0 




E Uci + crv 


{ci,...,c„}eC3,„ 


mtl 


where the last identity follows from equation 2.49 in [2]. Accordingly, (3.7) follows from 
(A.8) by setting r = 1. Regarding (3.6), an inversion of the generating function for the 
rth factorial moment in (A.8) leads to 




■J.n) _ 


= X\ 

1 


i9 + n)mn dt 




(A.9) 
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x+y 

xE 


v—O 


( ) 

\x + y-vj 


(-1)" E 


9 -\-n — Tie 

K i—1 


■ av 


m'l'l 


where 


dt- 




x+y 


= (-1)^(x + 2/)^4,i. 

t=0 


The proof is then completed by means of standard algebra involving factorial numbers 
and binomial coefficients. Specifically, since = 0 for any y > j — x then (A.9) can 

be written as 




{9 + n)mti “ 


E(-i)^-1„!JE(-i) 


y-x^ 


V—O 


j-V 

y-'v. 


X ^+«-E Uci + (TV 


mtl 


1 


3 J-'u 


{9 + n)mti 


(-irEE(-i)^ 


11=0 y —0 


j -v\(y + v 
y X 


E r+”"E rici + 


{ci,...,c„}eCj,„ 
1 


mfl 


{9 + n)m'[i 




D=0 


V 

X — j + v 


E r+”"E n-ci + (Jv 


{ci,...,c„}eC 3 .„ 


mfl 


which leads to (3.6) by means of standard algebraic manipulations involving factorial 
numbers. □ 

Proof of Proposition 2. A combination of the rth factorial moment of in The¬ 

orem 1 with Vn,j displayed in (2.3) leads to 




m 7 rilJ 


^{n,j]a){9 + n)mn ^ 
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T'-U-V) 

E 


{s,v-a)'^{n - s,j - v;a) 


E(“+'? ) '^{m,k-a,-n-\-s + {j -v)a) 

/fetl 


(A.IO) 


k^o ^ 

r\ 


^{n,j;a){0 + n)mn'^o\r-v)^ 




(j(fi' + n- s + i;cr)„f-i‘^(s, z;; (T)‘^(n - sj - v; a), 


where the last identity follows from equation 2.49 in [2]. Accordingly, (3.9) follows from 
(A.IO) by setting r = 1. Regarding (3.8), an inversion of the generating function for the 
rth factorial moment in (A.IO) leads to 


F[Rt^^=x] 


'^{n,j;a){0-\-n)mti ^ xl dt 

x+y / ■ \ 

xEf V-ir 

n-(j-v) 




t^o 


(A.11) 


/n\ 

E r>cr)m-f-i'^(s,u; a)'ff(n -s,j - v; a), 


where 


dt- 




x+y 


t=0 


The proof is then completed by means of standard algebra involving factorial numbers 
and binomial coefficients. Specifically, since = 0 for any y > j — x then (A. 11) 

can be written as 


F[Rt^'>=x] 
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'^in,j;cr){0^ 


' v=0 


i-i)y 


y 

y-x 
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v—O 


j -V 

y-v 


(-ir 
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S + w)mti‘^(s, v; a)‘^{n - sj 
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1 


3 3-'v 


n-ij-v) 


(-irEE(-i)" 

y—O 


j -v\ fy + v 


E (j(fi' + n- s + vo)m^i^{.s, v; a)^{n - sj - v; a) 
(-ir^(-l) 


^in,j;cr){9 + n)mn 

n-ij-v) 






X — j -\-v 


/\ 

E V, a)^{n -s,j - v; a) 


which leads to (3.8) by means of standard algebraic manipulations involving factorial 
numbers. □ 


A.2. Proofs of the results in Section 3.2 


Proof of Theorem 2. With regard to the rth factorial moment of this is 

obtained by a direct application of Theorem 1 in [8]. This completes the first part of 
the proof. With regard the rth factorial moment of , this is obtained by combining 
(3.12) with the distributions displayed in (2.1) and (2.2). Specihcally, we can write the 
following expression 
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\l,... — rl 




^77,1 , . . . , ?lo- 


1^(1 -cr)(„,_ 


i)n 


(A.12) 


X E Y[iric,-cr)in 

{ci.....Cr}eC,-...i=l 
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fc=0 


Vn+m,j+k ^{m-rl,k-,a,-n + Yli=inci + [j - r)g) 




n,3 


and prove that it coincides with (3.13). As in Theorem 1 the main issue consists in 
solving the sums over the collection of indexes ni,..., rij and ci,..., c^. First, by means 
of equation 2.61 in [2] and using the fact that Cj^r has cardinality (^), from (A.12) one 
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has 




-w E 


hn+mj'+fc 1 

X E E - E 


Si — l S2 — 1 
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J 7 (i - o')(si-i)ti(sj ->^)in 


Sr = l 


Si, . . . , Sf, 71 X/i=l ' 
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(A.13) 
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-rl.k-^a.-n + '^Si + {j - r)a W{ n-^Si,j - r; 


a . 


i=l 


i=l 


As in the proof of Theorem 1, in order to solve the nested sums over the indexes si,..., s^ 
in (A.3) we first deal with the sum over the index Sr- Recall that for any x > 0 and 
0 < y < X, for any a>0, b > 0, c > 0 and for any real number d one has the following 
identity 




See Chapter 2 of [2] for details. Then, let us consider the sum over the index Sr in (A.13), 
that is. 
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; — rl, k; a, —n + ^ s^ + s,. + (j — r)a j'/fln — ^ Si + Sr, j — r;cr . 
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By a direct application of (A. 14) to the coefficients '^{m — rl,k-,a, —n + X][=i Si + {j — r)a) 
and — X]i=i ■soi ~ we can write the last expression in the following expanded 
form 


1 


n 


’■ \si,...,Sr-i,n- 
^—rl 
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r —1 
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E r — i T 

i=i / _ -^r — l 

xj:r-^^W(t,k;a) j: 

t—k ^ ' z—j—r 




{z,j -r;a,-{j -r)a) 


E r — 1 

_i = l / _ _ Y^r —1 

Sr = l ^ 


X n 


-J2s,-Sr-U - r)<J ] (-(j - s,-Sr)n 

/ (m-ri-t)tl 
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(by the Vandermonde’s identity to expand (n — Si — Sr — (j — r)a)(^jyi-ri-t)^i) 
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Si , ^r-1 

(i \ j n — z — s., 

x(l-a)(z_i)^i E ( 3 

Sr = l ^ ^ 


x{i- <^)sAii-U - r)<j + si-sr)n 

(by equation 2.60 in [2] to solve the sum over the index s^) 

,v^ 1 

1 


erf \Si,...,Sr-l,n-J2i=l SiJ j 
— rl 


-1 «. ) “ cr)(«,_i)-,-i(si - cr)z^i 


2=1 


V- , /V- V^-W-A, , 

E t E h 

t=fe ^ / /2=0 ^ ^ 
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^ Si 1 / 

n - T,i=l Sz 


E 


{z)i^rn-ri-t-h)n‘^{z,j “ r;a,-{j - r)a) 


(1 - cr)(;_i)|i(-l)^ \n-z-^Si]l\a-l,{j-r)(j-h 


i=l 


providing the solution for the innermost nested sum over the index Sr- Therefore, accord¬ 
ing to the last identity, the rth factorial moment of in (A. 13) has the following 

reduced expression 






, m • 


-rl)^ 




^n-\-m,j-\-k 1 

(T^ 

* n,j ^ 


X E E - E 


Sl — 1 S2 — 1 

r — 1 


Sr-l—1 


v-^r—1 

Si, . . . , St^—I , Tl 


11(1 - cr)(s._i)-|-i(si 


(A.15) 


m—rl 


t—k 




t 


1—rl—t 


h—0 


m — rl — t 
h 




1 Ei=i Si 1 

1'^- 

z^j-r 


{z)m-ri-t-h'^{z,j - r; cr, -(j - r)a) 


^ ~ ^)d-i)ti(-l)'^ - z - y s,; 1; cr - (j - r)a - hj . 

Starting from (A.15) we can now repeatedly apply equation 2.60 in [2] to solve the 
remaining sums over the indexes si,..., s^-i, respectively, starting from the index Sr-i 
and proceeding backward to the index si. As an example, consider the sum over the 
index Sr._i, that is. 


(si-i) 

E 

Sr-l — l 


Si, 


...,Sr-l,a z^i=l '’*/ 


m—rl 

E 

t—k 


m — rl 
t 


i—rl — t 


k-,cr) 


h^O 


m — rl — t 


^(-(j-r)(T)^ti 


(A.16) 
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^ Si 1 / 

n - T,i=l Sz 


E 


{z)i^rn-Ti-t-h)n‘^{z,j - r;a,-{j - r)a) 


X 


fj-? 

which can be written as 

1 


— (1 - cr)(i_i)^i(-l)‘^ in - z - ^Si]l]a - l,{j - r)a - h 


2=1 


- YJ'l ..) n“ ^ W<»' - "’"I 


m—rl 

E 

t—k 


m — rl 
t 


m—rl—t 


,k-,cr) 


h—0 


m — rl — t 
h 


^(-(j-r)a)^^i 


2 :^J —r+l ^ 


E v — 2 

i=i 

Sr-l—l 


I 1 \;^r — 2 

n-z + l- s, 

^r—1 


X (Z - cr)s^_itl‘^ n-z + l- ^Si- 3 ^.- 1 , 1; cr - Z, (j - r)cr - h 


2=1 


(by equation 2.60 in [2] to solve the sum over the index s^-i) 


E r — 2 
2=1 


r-2 

11(1 

2=1 


iz—rl 


t—k 


m — rl 


i—rl — t 




^=0 


m — rl — t 




X E 

z—j—r+l 


E r —2 

i=l Si 
z-1 


(z - l)(^_rt-t-h)nr(z - 1, j - ?■; cr, -(j - r)a) 


^^((1 - cr)(i_i)|i)^2!(-l) V - z + 1 - ^ s,, 2; CT - (j - r)a - hj . 


The resulting expression has the same structure of the summand in (A. 16). This fact 
suggests the possibility of repeating exactly the above arguments to each of the remaining 
nested sum over the indexes 3 ^.- 2 , j si, respectively. In particular, after a repeated 
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application of these arguments we can write the rth factorial moment of in (A. 15) 


as 




’{n,j;cr) \l,...,l,m-rl 

m—rl / 

J: J: 




<J^ 


E- 


k^O 


(T^ 

y n.i ^ 


t—k 


m—rl—t 


h—Q 


m — rl — t 

h 




(A.17) 


n-r 

E ( 2 ) i^){m-ri-t-h)nr{z,j - r; cr, -O' - r)a)‘^{n -z,r]<T-l, {j - r)a - h). 


z=j-r 


Finally, by applying (A.14) to expand ‘rf{n — z,r;cr — I, {j — r)a — h) in (A.17), we can 
write (A.17) as 




m 


‘^(n,j]a) \l,... ,l,m — rl 

n-U-r) 


^Vn+m,j+k 1 


E 


n 
s 

m—rl — t 


m—rl 




t—k 


(j3-^ 

m — rl 
t 


■E' 


T/^ (T^ 

k=o ^ 


r{t,k;a) ^ 


h—0 


m — rl — t 


^(-0-r)a)^^i 


n—s / _ \ 

E ^ J {Z)(^ra-rl-t-h)ni-ij “ 


on 


z^j-r 

xr{z,j -r;a,-{j -r)a) 


which leads to (3.13) by means on (A. 14) and some standard algebra involving factorial 
numbers and noncentral generalized factorial coefficients. This completes the second part 
of the proof. □ 

Proof of Proposition 3. By combining the rth factorial moment of in Theo¬ 

rem 2 with Vnj displayed in (2.3) one has 


r\ 


m 


{9 -^n)mn \l,---,l,m-rl 









34 


S. Bacallado, S. Favaro and L. Trippa 


X 

e 

a 


(A.18) 


r! 


+ (j - r)c 


2=1 


(d + n}m.n \l,...,l,m-rl 


X ^ ]~7 {uci - cr)in jO + n-y^nc.+crr 

{ci,...,Cr}GCj,r ^ —1 V 1 ) 


(m— 


where the last identity follows equation 2.49 in [2]. Accordingly, (3.15) follows from (A.18) 
by setting r = 1. With regard to (3.15), an inversion of the generating function for the 
rth factorial moment in (A.18) leads to 




= x 


.T (\t^ 


( 

m \ 

t=0 • 

1 

-f 


{9 + n)m-\i ^ 2 ;! dt 

x+y / x+y \ 

X ^ ]^(nci-cr);^i e + n-^rici+o-(a; + ?/) 

{ci^...,Cx + y}^Cj^x + y X i —1 / 


(A.19) 


(m-(x+y)0tl 


where 


^(t-iY+y 


= {-l)y{x + y)^ii. 


t=o 


Then (3.14) follows from (A.19) by means of standard algebra involving factorial numbers 
and binomial coefficients. □ 

Proof of Proposition 4. A combination of the rth factorial moment of in The¬ 

orem 2 with Vn,j displayed in (2.3) leads to 




1 


'^{n,j]a){6 + n)mn \l,. ■. ,l,m - rl 

n-(j-r) 




(^^'^{s,r-a-l)^{n-s,j-r-,a) 

s—r X '' 

™- / Q \ 

( — h j j '^(ra — rl,k\a,—nF s F {j — r)a) 

t-nV^ /fetl 


(A.20) 
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n-{j-r) .. 

^ XI ( ){0 + n-s + ur)( m—rl)'\l 

s—r '' 

^{s,r;a- l)^{n - s,j -r-,a), 


where the last identity follows from equation 2.49 in [2]. Accordingly, (3.17) follows from 
(A.20) by setting r = 1. With regard to (3.16), an inversion of the generating function 
for the rth factorial moment in (A.20) leads to 




'^{n,j;a){0-\-n)mti ^ 2 ;! dt 




t=0 
(a:+y) 


^(-Cr(l -cr)q_i)|i) 

—'T-l-l/ ' ' 


+ y)l 

-U-x-v) 


s=x+y 

X ^(s,a: + y;cr - l)^{n -s,j-{x + y); a) 


(A.21) 


where 


dt- 




x+V 


= (-1)^(x + 2/):j4,1. 

t =0 


Then (3.16) follows from (A.21) by means of standard algebra involving factorial numbers 
and binomial coefficients. □ 


A.3. Proofs of the results in Section 3.3 

Proof of Lemma 3.1. By suitably marginalizing the EPPF in (2.1) one obtains the 
distribution of (Ar„,N 7 -), that is the main ingredient for determining (3.20). Specifically, 
one has 


F[Kr,= j,'Nr,n = n^] 

U-pV- 


= Vr 


n,7 .1 

j'- 
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U-pV- 
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(’"-1 


jj(l-cr)(„^._ 


l)tl 






1 ' ^Ti 
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U-p) 

i=l 


(A.22) 


= v; 


U-pV- 


n,7 ■■ 

j'- 






i)Ti 




-P]<7) 


ai-P 

where the last identity is obtained by a direct application of equation 2.61 in [2]. The 
proof is completed by taking the ratio between the distributions displayed in (2.1) and 
(A.22). □ 
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